01 - Prediction

Pre-Class Code Assignment Instructions

In this semester, I am going to ask you to do a fair bit of work before coming to class. This will make our class time shorter, more manageable, and hopefully less boring.

I am also going to use this as an opportunity for you to directly earn grade points for your effort/labor, rather than “getting things right” on an exam.

Therefore, I will ask you to work through the posted slides on Canvas before class. Throughout the document, I will post Pre-Class Questions for you to work through in R. These will look like this:

Pre-Class Q1

In R, please write code that will read in the .csv from Canvas called sf_listings_2312.csv. Assign this the name bnb.

You will then write your answer in a .r script:

Click to show code and output
# Q1
bnb <- read.csv("sf_listings_2312.csv")

Important:

To earn full points, you need to organize your code correctly. Specifically, you need to:

  • Answer questions in order.
    • If you answer them out of order, just re-arrange the code after.
  • Preface each answer with a comment (# Q1/# Q2/# Q3) that indicates exactly which question you are answering.
    • Please just write the letter Q and the number in this comment.
  • Make sure your code runs on its own, on anyone’s computer.
    • To do this, I would always include rm(list = ls()) at the top of every .r script. This will clean everything from the environment, allowing you to see if this runs on my computer.

Handing this in:

  • You must submit this to Canvas before 9:00am on the day of class. Even if class starts at 10:00am that day, these are always due at 9:00.
  • You must submit this code as a .txt file. This is because Canvas cannot present .R files to me in SpeedGrader. To save as .txt:
    • Click File -> New File -> Text File
    • Copy and paste your completed code to that new text file.
    • Save the file as firstname_lastname_module.txt
      • For example, my file for Module 01 would be matt_meister_01.txt
      • My file for module 05 would be matt_meister_05.txt

Grading:

  • I will grade these for completion.
  • You will receive 1 point for every question you give an honest attempt to answer
  • Your grade will be the number of questions you answer, divided by the total number of questions.
    • This is why it is important that you number each answer with # Q1.
    • Any questions that are not numbered this way will be graded incomplete, because I can’t find them.
  • You will receive a 25% penalty for submitting these late.
  • I will post my solutions after class.

Prediction

Load in these packages. If you do not have them, you will need to install them.

  • e.g., install.packages("tidytext")
Click to show code and output
library(tidytext)
library(stringr)
library(dplyr)
library(ggplot2)

And read in these data:

Click to show code and output
bnb <- read.csv("sf_listings_2312.csv")

Why are we doing this?

The goal of data analysis/science is to make predictions about the future.

We do not just analyze data to find some neat facts about Airbnb properties (for example). We analyze data to understand how the world works. Because that will give us our best guess about how the world will work in the future.

And we have some control over the future! The things we discover about the past are helpful only because they can inform the choices we make for the future.

For example:

One of the common praises of Airbnbs is that they are often larger than hotels, allowing friends to stay together. If it is true (that staying together is a benefit of Airbnbs), we should find that Airbnbs that accommodate more people get higher ratings than Airbnbs who accommodate fewer people.

Pre-Class Q2

In R, test with linear regression whether listings that accommodate more people receive higher ratings.

Note: The variable for average rating is called review_scores_rating.

Click to show code and output
summary(lm(data = bnb, review_scores_rating ~ accommodates))

Call:
lm(formula = review_scores_rating ~ accommodates, data = bnb)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.8139 -0.0364  0.1267  0.2130  0.2936 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.679486   0.010714 436.767   <2e-16 ***
accommodates 0.026885   0.002819   9.537   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4394 on 6173 degrees of freedom
  (1881 observations deleted due to missingness)
Multiple R-squared:  0.01452,   Adjusted R-squared:  0.01436 
F-statistic: 90.95 on 1 and 6173 DF,  p-value: < 2.2e-16

Pre-Class Q3

Do listings who accommodate more people receive higher ratings? Answer this in a comment.

Click to show code and output
# Yes. The coefficient is > 0, and the p-value is very low. This suggests that places who accommodate more people receive significantly higher ratings.

This result tells us something about the future because the p-value indicates that the relationship between review_scores_rating and accommodates is stronger than we would expect from chance alone. Therefore, it is not due to randomness that larger places received higher ratings.

One condition that we have to satisfy in order to use past data to predict the future is that the results we observe must not be due to chance. Because chance is random. This is easy to test by looking at p-values. We have just done this for a single variable (accommodates), but we can also do it for multiple variables.

Model Comparison

The world is a complicated place. While we often care about individual variables, it is more common that we want to create (semi-) complex models, which use multiple variables to predict an outcome. To know whether adding this complexity is worth it, or which of a series of models will best predict the future, we have to compare those models.

We will cover two forms of model comparison. First, using anova(). Second, splitting our data into training and testing sets.

Anova

Anova is an acronym for analysis of variance. It takes two models, and analyzes the degree to which the variance reduced by the more complex one is larger than chance.

Advantages:

  • Relatively simple to code and run
  • Requires less data than training/testing sets
  • Provides a p-value, which is familiar

Disadvantages

  • Not very common anymore (because we have more data than we used to)
  • Cannot compare models of equal complexity
    • e.g., Cannot compare rating ~ price to rating ~ accommodates

That second disadvantage is pretty substantial. As a result, you won’t see this as much.

If I wanted to see whether a model with accommodates and price predicted review_scores_rating better than a model with accommodates alone, I could do the following.

First, let’s look at the variables we have:

Click to show code and output
hist(bnb$review_scores_rating)

hist(bnb$accommodates)

hist(bnb$price)
Error in hist.default(bnb$price): 'x' must be numeric

Whoa! price threw an error. This is because it is not numeric. It has that annoying $ before it, and also has a , when prices are in the thousands. We’ll need to remove those with the following code:

Pre-Class Q4

Just copy this into R.

Click to show code and output
# First, remove the dollar sign
bnb$price <- str_remove(bnb$price, "\\$")
# You need the "\\" because $ is a special character with regular expressions, but we just want the symbol gone.

# Second, remove the comma
bnb$price <- str_remove(bnb$price, ",")

# Third, make this a number
bnb$price <- as.numeric(bnb$price)

Pre-Class Q5

See whether a model with accommodates and price predicted review_scores_rating better than a model with accommodates alone.

Click to show code and output
model_simple <- lm(data = bnb, review_scores_rating ~ accommodates)
model_complex <- lm(data = bnb, review_scores_rating ~ accommodates + price)
anova(model_simple, model_complex)
Analysis of Variance Table

Model 1: review_scores_rating ~ accommodates
Model 2: review_scores_rating ~ accommodates + price
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1   6173 1192.0                                  
2   6172 1185.8  1     6.181 32.171 1.476e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

That p-value being extremely small tells me that the variance reduced by the more complex model was well worth the added complexity. This should make us confident that using the complex model will lead to better predictions in the future than using the simpler model.

Train/Test Splits

Another way we can do this model comparison is by splitting our data into two sets (training and testing), and seeing how well each model that we fit to the training set predicts the testing set.

Advantages:

  • Super common (especially in machine learning)
  • Very flexible–can use different variables, models, etc
  • Rather than a p-value, we can directly compare the error between models
    • This is because the p-value is approximating what we are actually doing here

Disadvantages

  • Needs more data to be reliable
  • More complex to code

First: split our data into a training and a testing set

  • Usually, the training set is 70%-80% of the data.
  • We do this by randomly sampling our data with sample().
    • We should randomly sample the id, as this is unique to each row.
  • Because there is randomness, we will use set.seed() to get consistent results.

Pre-Class Q6

Create splits of bnb into training and testing data.

Click to show code and output
set.seed(28)
samp_size <- round(.7 * nrow(bnb)) # I am rounding this to get a whole number
train <- sample(bnb$id, size = samp_size, replace = FALSE)

bnb_train <- bnb[bnb$id %in% train,] # Index bnb where the id is inside train
bnb_test <- bnb[!(bnb$id %in% train),] # Index bnb where the id is NOT inside train

Second: train each model on the training set

Pre-Class Q7

Click to show code and output
model_simple_train <- lm(data = bnb_train, review_scores_rating ~ accommodates)
model_complex_train <- lm(data = bnb_train, review_scores_rating ~ accommodates + price)

Note: You need to use bnb_train as the data, not bnb. Also, if you look at model_simple and model_simple_train, you will see the coefficients are slightly different. Think of why this is.

Third: test each model on the testing set.

Pre-Class Q8

Do this by:

  1. Making predictions about review_scores_rating from each model.
  2. Calculating the squared error for each model.
  3. Comparing those numbers.
Click to show code and output
#a) predictions
bnb_test$prediction_simple <- predict(object = model_simple_train,
                             newdata = bnb_test, 
                             type = "response")

bnb_test$prediction_complex <- predict(object = model_complex_train,
                             newdata = bnb_test, 
                             type = "response")

#b) calculate error
bnb_test$error_simple <- (bnb_test$prediction_simple - bnb_test$review_scores_rating)^2

bnb_test$error_complex <- (bnb_test$prediction_complex - bnb_test$review_scores_rating)^2

#c) compare error
mean(bnb_test$error_simple, na.rm = T)
[1] 0.2002133
mean(bnb_test$error_complex, na.rm = T)
[1] 0.1999402
  • predict() spits out a prediction for each row in the new data when we specify type = "response".
    • This is why we can just attach it as a column.

In this case, because we are actually making predictions, we do not need p-values. We can just look at the reduction in error itself.

One thing you will notice is that we barely reduced any error from the average prediction. If that seems odd, you can try different things:

  • Cross-validation (described here)
    • This repeats the process we just did (splitting, training, testing) multiple times on the same set of data, potentially reducing the impact of randomness in our sampling
  • Variable selection
    • price is not normally distributed at all. So we may be getting some weirdness in the results as a result. You could see how results change if you transform price somehow.

Actually Making Predictions

The simple model (from the anova() example) predicts review_scores_rating as the following function:

  • review_scores_ratingi = 4.68 + 0.027 \(\times\) accommodatesi
    • In this function, \(i\) denotes any individual listing.

The complex model predicts review_scores_rating as the following function:

  • review_scores_ratingi = 4.68 + 0.027 \(\times\) accommodatesi + -0.000016 \(\times\) pricei

Why does this matter?

With the simple model, we predict that the cheapest and the most expensive 4-person Airbnb would have the exact same ratings.

Pre-Class Q9

See these predictions with predict(). To use predict(), we need to create a new data.frame that has the exact same variables as the variables in our models.

Click to show code and output
# Minimum price for a four-person bnb
min_fourperson <- min(bnb[bnb$accommodates == 4, ]$price)
# Maximum price for a four-person bnb
max_fourperson <- max(bnb[bnb$accommodates == 4, ]$price)

new_data <- data.frame(
  accommodates = c(4, 4),
  price = c(min_fourperson, max_fourperson)
)

predict(object = model_simple, #Predict takes an "object" (a model)
        newdata = new_data, # and it takes "newdata"
        type = 'response') 
       1        2 
4.787026 4.787026 

Pre-Class Q10

With the complex model, we do not predict that the cheapest and the most expensive 4-person Airbnb would have the exact same ratings. On your own, I want you to make those predictions for the complex model.

Making variables behave

Now, if you look at the distribution of price, you will see that it is a bit… crazy. For example, the maximum price of a four-person Airbnb is $50,000 per night!

You will probably get better predictions if you enforce some normality on price. On your own, I would like you to:

Pre-Class Q11

  • Create a new variable called bnb$log_price, which will be the natural log of bnb$price (Google this).
  • Create a new model called model_complex_log, using log_price in place of price.
  • Use `anova() to compare the simple model to this new complex model. Is it still better?

You will probably get better predictions if you enforce some normality on price.

Final Questions

These questions are for added points/difficulty. They count towards the total score.

Pre-Class Q12

You are not going to be able to test the original model_complex against model_complex_log using anova(). But, you can test these two using the training/testing example from earlier. Try to do that here.

Pre-Class Q13

I would like you to challenge yourself here, by attempting to test a totally new set of models. Predict review_scores_rating with new variables, using both anova() and training/testing splits.

In Class

In class, we are going to try out making predictions of review_scores_rating using whatever variables we want. You will then hand in your best models as your homework.

If you just use the variables that are already in the data set, you will struggle. Instead, I suggest creating new ones, cleaning the ones you have, etc.

For example, you might want to distinguish places that accommodate more than four people categorically:

Click to show code and output
bnb$accommodates_5plus <- ifelse(bnb$accommodates > 4, 1, 0)

And then see if this predicts ratings:

Click to show code and output
model_5plus <- lm(data = bnb, review_scores_rating ~ accommodates_5plus)
summary(model_5plus)

Call:
lm(formula = review_scores_rating ~ accommodates_5plus, data = bnb)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.8582 -0.0361  0.1339  0.2439  0.2539 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)        4.746147   0.006202 765.315  < 2e-16 ***
accommodates_5plus 0.112090   0.014503   7.728 1.26e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4405 on 6173 degrees of freedom
  (1881 observations deleted due to missingness)
Multiple R-squared:  0.009583,  Adjusted R-squared:  0.009423 
F-statistic: 59.73 on 1 and 6173 DF,  p-value: 1.262e-14

You might also want to pull out some specific amenity. Any time you are handling text, make it lower case!

Click to show code and output
bnb$amenities <- tolower(bnb$amenities)
bnb$kitchen <- ifelse(str_detect(bnb$amenities, 'kitchen'), 1, 0)